171: perf: Optimize st_has(z/m) using WKBBytesExecutor + Implement new WKBHeader#4
171: perf: Optimize st_has(z/m) using WKBBytesExecutor + Implement new WKBHeader#4martin-augment wants to merge 37 commits intomainfrom
Conversation
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
…t expected test results
…dencies from sedona-geometry crate
WalkthroughThe PR adds comprehensive support for WKB dimension inference by introducing a new WkbHeader parser in Rust geometry library, refactoring the STHasZm function to leverage fast-path byte parsing, and extending test coverage for Z and M dimensions across benchmarks and unit tests. Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Pull Request Review: PR #4 - Optimize ST_Has(Z/M) using WKBBytesExecutor + Implement new WKBHeaderOverviewThis PR optimizes the ✅ StrengthsPerformance Optimization
Test Coverage
Code Structure
🔴 Critical Issues1. Debug Print Statements Left in Production CodeLocation: Lines 68, 70, 83, 89, 90, 91, 243, 247 contain // Line 68-70
println!("geometry_type_id: {geometry_type_id:?}");
println!("size: {size}");
println!("buf: {:?}", &wkb_buffer.buf[wkb_buffer.offset..]);
// Line 83
println!("starting parse_dimensions buf: {:?}", &buf[i..]);
// Line 89-91
println!("first_xy_coord: buf: {:?}", &wkb_buffer.buf[wkb_buffer.offset..]);
// Line 243, 247
println!("first_geom_idx: buf: {:?}", &self.buf[self.offset..]);
println!("geometry_type_id: {geometry_type_id:?}");Action Required: Remove all 2. Extensive Commented-Out CodeLocation: Lines 54-141, 235-242, 300-373, 399-402, 432-436, 481-485 contain large blocks of commented-out code. This significantly reduces code readability and maintainability. Action Required: Remove all commented-out code. If you need to reference the old implementation, it's available in git history. 3. TODO Comments in Production CodeLocation: Lines 77, 79, 84, 88 contain "TODO: clean up later" comments: let mut wkb_buffer = WkbBuffer::new(buf); // Reset: TODO: clean up laterIssue: These TODOs suggest incomplete refactoring. The current approach creates multiple Recommendation: Either:
|
|
|
||
| /// Returns the top-level dimension of the WKB | ||
| pub fn dimensions(&self) -> Result<Dimensions, SedonaGeometryError> { | ||
| let dimensions = match self.geometry_type / 1000 { |
There was a problem hiding this comment.
Consider masking out the SRID flag before deriving dimensions; using self.geometry_type / 1000 will misclassify EWKB with SRID set (unexpected code). This also applies to parse_dimensions() where code / 1000 is used.
🤖 React with 👍 or 👎 to let us know if the comment was useful.
| // )); | ||
| // } | ||
| // let num_geometries = self.read_u32(&buf[5..9], byte_order)?; | ||
| let num_geometries = self.read_u32()?; |
There was a problem hiding this comment.
For collections, the SRID (EWKB) flag should be handled before reading the count; here num_geometries is read before skipping SRID, so with SRID set this reads SRID instead of the count and throws off offsets.
🤖 React with 👍 or 👎 to let us know if the comment was useful.
| if let Some(i) = first_geom_idx { | ||
| // first_geom_dimensions = Some(self.parse_dimensions(&buf[i..])?); | ||
|
|
||
| println!("starting parse_dimensions buf: {:?}", &buf[i..]); |
There was a problem hiding this comment.
Leftover println! debug statements (e.g., here and below) will pollute stdout during normal runs; consider removing or gating them behind a debug flag. (also applies to other locations in this file)
🤖 React with 👍 or 👎 to let us know if the comment was useful.
|
Findings
Suggested follow-up
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rust/sedona-functions/src/st_haszm.rs (1)
163-233: Add tests for Z/M in non-first child to verify the current behavior.The
invoke_scalarfunction at lines 110-135 only inspects the first geometry in a collection viaheader.first_geom_dimensions(). This creates a potential gap: aGEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1))would incorrectly returnfalseforST_HASZ()since only the first child (which lacks Z) is examined.Add these test cases to ensure the implementation's handling of multi-child collections is intentional and correct:
m_tester.assert_scalar_result_equals(result, ScalarValue::Boolean(Some(true))); + // Non-first child carries Z: verify current behavior + let result = z_tester + .invoke_wkb_scalar(Some("GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1))")) + .unwrap(); + z_tester.assert_scalar_result_equals(result, ScalarValue::Boolean(Some(true))); + + // Non-first child carries M: verify current behavior + let result = m_tester + .invoke_wkb_scalar(Some("GEOMETRYCOLLECTION (POINT (0 0), POINT M (0 0 1))")) + .unwrap(); + m_tester.assert_scalar_result_equals(result, ScalarValue::Boolean(Some(true)));
🧹 Nitpick comments (9)
rust/sedona-testing/src/fixtures.rs (1)
31-44: Fixture looks correct; bytes match a MultiPoint(XY) with a nested PointZ(1,2,3).Nice targeted case for mixed-dimension inference.
Optionally mention EWKB thousand-offset (1001 = Point Z) in the doc comment for quicker auditing.
python/sedonadb/tests/functions/test_functions.py (1)
564-571: Add a case where Z/M appears in a non-first child.E.g., GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1)) should be True for ST_HasZ; similarly for M. This guards against first-child-only inference regressions.
rust/sedona-geometry/src/lib.rs (1)
27-27: Public exposure of wkb_header looks good.Consider adding top-level module docs and (optionally) re-exporting key types (e.g., pub use wkb_header::WkbHeader) for discoverability.
rust/sedona-functions/src/st_haszm.rs (1)
109-110: Nit: comment mismatch.This returns a boolean, not a “geometry type name”. Update the function comment.
rust/sedona-geometry/src/wkb_header.rs (5)
162-169: Base type extraction: prefer masking thousands over& 0x7for clarity.
& 0x7works by coincidence (codes 1..7), butbase = (geometry_type & !SRID_FLAG_BIT) % 1000communicates intent and avoids surprises if additional flags appear.Apply:
- // Only low 3 bits is for the base type, high bits include additional info - let code = self.geometry_type & 0x7; + // Strip flags and dimension offsets, then take base WKB type in 1..7 + let code = (self.geometry_type & !SRID_FLAG_BIT) % 1000;
77-99: Avoid triple buffer re-inits and remove debug prints.
try_newrecreatesWkbBuffer3x and prints raw buffers. Streamline by computingi = first_geom_idxonce and reusing a single subslice; gate logging behind#[cfg(debug_assertions)]orlogat debug.Apply:
- let mut wkb_buffer = WkbBuffer::new(buf); // Reset: TODO: clean up later - let first_geom_idx = wkb_buffer.first_geom_idx()?; // ERROR HERE + let mut wkb_buffer = WkbBuffer::new(buf); + let first_geom_idx = wkb_buffer.first_geom_idx()?; if let Some(i) = first_geom_idx { - println!("starting parse_dimensions buf: {:?}", &buf[i..]); - let mut wkb_buffer = WkbBuffer::new(&buf[i..]); // Reset: TODO: clean up later + let mut wkb_buffer = WkbBuffer::new(&buf[i..]); first_geom_dimensions = Some(wkb_buffer.parse_dimensions()?); - let mut wkb_buffer = WkbBuffer::new(&buf[i..]); // Reset: TODO: clean up later - println!( - "first_xy_coord: buf: {:?}", - &wkb_buffer.buf[wkb_buffer.offset..] - ); + let mut wkb_buffer = WkbBuffer::new(&buf[i..]); (first_x, first_y) = wkb_buffer.first_xy_coord()?; } else { first_geom_dimensions = None; first_x = f64::NAN; first_y = f64::NAN; }And remove/guard other
println!calls in this file.
241-248: Strip debug prints from hot path.
println!in parsing paths will flood stdout and hurt perf. Replace withlog::debug!or guard with#[cfg(debug_assertions)].Example:
- println!("first_geom_idx: buf: {:?}", &self.buf[self.offset..]); + #[cfg(debug_assertions)] + eprintln!("first_geom_idx: buf: {:?}", &self.buf[self.offset..]); @@ - println!("geometry_type_id: {geometry_type_id:?}"); + #[cfg(debug_assertions)] + eprintln!("geometry_type_id: {geometry_type_id:?}");Also applies to: 270-276
62-67: Clarifysizesemantics for Point or returnOption<u32>.
size()returns1for Point but docs say “Not applicable for a point”. Consider returningOption<u32>or document “Point always returns 1”.Possible API tweak:
- Keep
u32and specify: “For Point, returns 1.”- Or change to
Option<u32>whereNone= not applicable.Also applies to: 171-178
507-885: Great test coverage; add SRID/EWKB cases to lock the fixes.Current tests skip SRID. Add minimal EWKB byte fixtures to assert
dimensions(),size(), andfirst_xy()with SRID present for single and collection geometries.I can contribute a small helper to inject SRID into WKB:
fn with_srid(mut wkb: Vec<u8>, srid: u32) -> Vec<u8> { // wkb[0] = endian, wkb[1..5] = type; set SRID flag and insert srid after type let endian = wkb[0]; let mut ty = u32::from_le_bytes(wkb[1..5].try_into().unwrap()); // writer defaults to LE ty |= SRID_FLAG_BIT; wkb[1..5].copy_from_slice(&ty.to_le_bytes()); wkb.splice(5..5, srid.to_le_bytes()); // insert SRID at byte 5 wkb }Then test e.g.
with_srid(make_wkb("MULTIPOINT ((1 2))"), 4326)to verify bothdimensions()and collection parsing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
benchmarks/test_functions.py(1 hunks)python/sedonadb/tests/functions/test_functions.py(1 hunks)rust/sedona-functions/src/st_haszm.rs(6 hunks)rust/sedona-geometry/src/lib.rs(1 hunks)rust/sedona-geometry/src/wkb_header.rs(1 hunks)rust/sedona-testing/src/fixtures.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
rust/sedona-functions/src/st_haszm.rs (1)
rust/sedona-geometry/src/wkb_header.rs (6)
new(222-229)try_new(45-150)dimensions(191-205)dimensions(778-794)first_geom_dimensions(208-210)first_geom_dimensions(834-852)
benchmarks/test_functions.py (3)
python/sedonadb/python/sedonadb/testing.py (7)
SedonaDBSingleThread(347-353)PostGISSingleThread(618-624)DuckDBSingleThread(407-412)execute_and_collect(156-168)execute_and_collect(342-344)execute_and_collect(403-404)execute_and_collect(533-537)python/sedonadb/tests/functions/test_functions.py (2)
test_st_hasm(601-603)test_st_hasz(574-576)benchmarks/test_bench_base.py (1)
_get_eng(147-161)
🪛 Ruff (0.14.1)
benchmarks/test_functions.py
184-184: Possible SQL injection vector through string-based query construction
(S608)
202-202: Possible SQL injection vector through string-based query construction
(S608)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: claude-review
- GitHub Check: docs-and-deploy
- GitHub Check: python ubuntu-latest
- GitHub Check: codex
- GitHub Check: docs
- GitHub Check: build
- GitHub Check: test
- GitHub Check: ubuntu-latest (release)
- GitHub Check: windows-latest (release)
- GitHub Check: clippy
- GitHub Check: macos-latest (release)
🔇 Additional comments (6)
benchmarks/test_functions.py (2)
170-205: Bench additions LGTM; consistent with existing style.Engines/tables and query harness match the suite.
183-185: Static analysis S608 is low-risk here.Tables come from a fixed enum; no untrusted input. Nothing to change.
If you later bench user-provided table names, prefer validated identifiers or prepared statements where supported.
Also applies to: 201-203
rust/sedona-functions/src/st_haszm.rs (3)
92-106: Good move to WkbBytesExecutor; minimizes allocation and copies.Capacity pre-sizing and null handling are correct.
195-233: Nice coverage for empty collections with Z/M flags and mixed-level Z/M on collections.These validate the intended inference shortcuts.
Please add a case where Z/M is present in a non-first child to prevent regressions (see diff below).
235-249: Great: inferred Z in MultiPoint with mixed header nesting is validated.Complements the fast path well.
rust/sedona-geometry/src/wkb_header.rs (1)
297-377: Coordinate parsing is correct; XY subset reads fine for XYZ/M/ZM.Logic correctly handles empty cases and consumes only X,Y for higher dimensions. No change required.
If you want, I can add tests for
LINESTRING Z/M/ZMandPOLYGON Z/M/ZMcovering empty/non-empty first rings.
| /// Fast-path inference of geometry type name from raw WKB bytes | ||
| fn invoke_scalar(buf: &[u8], dim_index: usize) -> Result<Option<bool>> { | ||
| let header = WkbHeader::try_new(buf).map_err(|e| DataFusionError::External(Box::new(e)))?; | ||
| let top_level_dimensions = header | ||
| .dimensions() | ||
| .map_err(|e| DataFusionError::External(Box::new(e)))?; | ||
|
|
||
| // Infer dimension based on first coordinate dimension for cases where it differs from top-level | ||
| // e.g GEOMETRYCOLLECTION (POINT Z (1 2 3)) | ||
| let dimensions; | ||
| if let Some(first_geom_dimensions) = header.first_geom_dimensions() { | ||
| dimensions = first_geom_dimensions; | ||
| } else { | ||
| dimensions = top_level_dimensions; | ||
| } | ||
|
|
||
| if dim_index == 2 { | ||
| return Ok(Some(matches!( | ||
| dimensions, | ||
| Dimensions::Xyz | Dimensions::Xyzm | ||
| ))); | ||
| } | ||
| if dim_index == 3 { | ||
| return Ok(Some(matches!( | ||
| dimensions, | ||
| Dimensions::Xym | Dimensions::Xyzm | ||
| ))); | ||
| } | ||
| Ok(Some(false)) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Correctness risk: only inspecting the first nested geometry can yield false negatives.
For collections/multis where the first child is XY but a later child is Z/M, this returns False even though the geometry “has Z/M”. Example: GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1)).
- Keep the current fast path, but when it evaluates to False for collections/multis, fall back to a lightweight scan of all nested headers (still bytes-only) and short-circuit on the first Z/M match.
- If iterating headers isn’t available yet, add a slow-path fallback (e.g., previous WkbExecutor) only in these uncertain cases to retain semantics without regressing perf on the common path.
Minimal sketch:
fn invoke_scalar(buf: &[u8], dim_index: usize) -> Result<Option<bool>> {
- let header = WkbHeader::try_new(buf).map_err(|e| DataFusionError::External(Box::new(e)))?;
+ let header = WkbHeader::try_new(buf).map_err(|e| DataFusionError::External(Box::new(e)))?;
let top_level_dimensions = header
.dimensions()
.map_err(|e| DataFusionError::External(Box::new(e)))?;
let dimensions = header.first_geom_dimensions().unwrap_or(top_level_dimensions);
let want = if dim_index == 2 {
matches!(dimensions, Dimensions::Xyz | Dimensions::Xyzm)
} else if dim_index == 3 {
matches!(dimensions, Dimensions::Xym | Dimensions::Xyzm)
} else {
false
};
- Ok(Some(want))
+ if want {
+ return Ok(Some(true));
+ }
+
+ // Optional: only for collection/multi types with potential for nested dims
+ // Pseudocode—requires iterator support from WkbHeader/WkbBuffer:
+ // for child in header.iter_children(buf)? {
+ // let dims = child.dimensions()?;
+ // if (dim_index == 2 && matches!(dims, Dimensions::Xyz | Dimensions::Xyzm))
+ // || (dim_index == 3 && matches!(dims, Dimensions::Xym | Dimensions::Xyzm)) {
+ // return Ok(Some(true));
+ // }
+ // }
+ Ok(Some(false))
}Add tests to cover non-first-child Z/M (see separate test suggestion below).
I'll verify this review comment by examining the actual implementation and related code to confirm whether the correctness risk is genuine.
🏁 Script executed:
#!/bin/bash
# Find the st_haszm.rs file and view its full content
fd -t f 'st_haszm\.rs' --exec cat {}Length of output: 9008
🏁 Script executed:
#!/bin/bash
# Search for WkbHeader definition and first_geom_dimensions method
rg -n 'fn first_geom_dimensions' -A 5 -B 2Length of output: 1300
🏁 Script executed:
#!/bin/bash
# Search for tests related to st_haszm to see what edge cases are covered
fd -t f 'test.*haszm|haszm.*test' -i --exec cat {}Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check for any geometry collection test cases with mixed dimensions
rg -n 'GEOMETRYCOLLECTION.*POINT.*Z|hasZ.*Collection|hasMember' -i --type rsLength of output: 26
🏁 Script executed:
#!/bin/bash
# Look for WkbHeader implementation and available methods
fd -t f '.*header.*\.rs|.*wkb.*\.rs' -i | head -20 | xargs grep -l 'struct WkbHeader\|impl.*WkbHeader' 2>/dev/nullLength of output: 38
🏁 Script executed:
#!/bin/bash
# Search for any iteration over children in WkbHeader or similar structures
rg -n 'iter.*child|for.*child|children\(' --type rs | head -20Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check if there are any other hasZ/hasM implementations or related logic
rg -n 'hasZ|hasM|Xyzm|first_geom' --type rs -B 2 -A 2 | head -40Length of output: 26
🏁 Script executed:
#!/bin/bash
# Search for the dimensions method and Dimensions enum
rg -n 'enum Dimensions|pub fn dimensions' --type rs -A 5Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for test files in the same directory
fd -t f '.*test.*\.rs|.*_test\.rs' rust/sedona-functions/src/ --exec ls -lh {}Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check what WKB types are considered collections/multis
rg -n 'Multi|Collection' --type rs | grep -i 'enum\|type\|const' | head -20Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for the exact invoke_scalar function and any related logic
rg -n 'fn invoke_scalar' -A 40 --type rsLength of output: 26
🏁 Script executed:
#!/bin/bash
# Check git history or comments around this function
rg -n 'first child|nested|collection' --type rs -i -B 2 -A 2 | head -50Length of output: 26
🏁 Script executed:
#!/bin/bash
# Search for WkbExecutor to understand the fallback mechanism mentioned
rg -n 'WkbExecutor|invoke_array' --type rs -B 2 -A 5 | head -40Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check if there's logic that handles MultiGeometry or GeometryCollection specially
rg -n 'GeometryCollection|Multi.*Type|0xF|0x7|WkbType' --type rs -C 2 | head -60Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for any safeguards or special handling in dimensions parsing
rg -n 'pub fn dimensions|fn dimensions' --type rs -A 15 | head -60Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check the structure and fields of WkbHeader
rg -n 'struct WkbHeader' -A 20 --type rsLength of output: 26
🏁 Script executed:
#!/bin/bash
# Look for the implementation details of first_geom_dimensions and top-level dimensions
rg -n 'first_geom_dimensions|top_level|nested_header' --type rs -B 3 -A 10Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check if there's any recursive or iterative scanning logic already in place
rg -n 'loop|while|for.*in' --type rs --context=5 | grep -A 8 -B 2 'child\|geom\|header' | head -50Length of output: 26
🏁 Script executed:
#!/bin/bash
# Verify the actual test file if it exists
find rust/sedona-functions/src -name '*test*' -o -name '*tests*' | xargs ls -lh 2>/dev/nullLength of output: 1130
🏁 Script executed:
#!/bin/bash
# Look for mod tests or #[cfg(test)] blocks in st_haszm
rg -n '#\[cfg\(test\)\]|mod tests|#\[test\]' rust/sedona-functions/src/st_haszm.rs -A 30Length of output: 2448
🏁 Script executed:
#!/bin/bash
# Check if there are integration tests that might cover this
find rust/sedona-functions -name '*test*.rs' -type f | xargs grep -l 'hasZ\|haszm\|POINT.*Z' 2>/dev/nullLength of output: 0
🏁 Script executed:
#!/bin/bash
# Look for the full context around the invoke_scalar function to understand how it's called
rg -n 'fn invoke_scalar|fn invoke_array' --type rs -B 5 -A 50 | head -80Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check if WkbBuffer or similar has methods for iterating through geometry components
rg -n 'impl.*WkbBuffer|struct WkbBuffer|fn.*children|fn.*iter' --type rs -A 8 | head -60Length of output: 26
🏁 Script executed:
#!/bin/bash
# Summarize WkbHeader available methods
rg -n 'impl.*WkbHeader' -A 100 --type rs | head -120Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check the actual geometry type codes to understand collection detection
rg -n 'GeometryType|WkbGeomType|0x07|0x08|MULTI|COLLECTION' --type rs | head -30Length of output: 26
🏁 Script executed:
#!/bin/bash
# Search for any dimension array scanning or all-children checking logic
rg -n 'scan|all.*dim|iterate|traverse' --type rs -i | grep -v test | head -20Length of output: 26
🏁 Script executed:
#!/bin/bash
# Final check: look at the exact line range mentioned in the review
sed -n '109,138p' rust/sedona-functions/src/st_haszm.rsLength of output: 1022
🏁 Script executed:
#!/bin/bash
# Check what top_level_dimensions actually returns and how it differs from first_geom_dimensions
rg -n 'top_level|first_geom_dimensions|dimensions\(' --type rs -B 3 -A 8 | head -80Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look at the complete WkbHeader structure and all its methods
ast-grep --pattern 'impl WkbHeader { $$$ }'Length of output: 12874
🏁 Script executed:
#!/bin/bash
# Check for GeometryCollection or Multi type detection at the top level
rg -n 'geometry_type|wkb_type|enum.*Wkb.*Type' --type rs | head -30Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for the exact semantics of what the function should return
rg -n 'hasZ|hasM|dimensions' rust/sedona-functions/src/st_haszm.rs -B 2 -A 2Length of output: 1297
🏁 Script executed:
#!/bin/bash
# Check if there's documentation about the expected behavior
rg -n '//.*[Zz].*dimension|//.*collection|//.*nested' rust/sedona-functions/src/st_haszm.rsLength of output: 361
🏁 Script executed:
#!/bin/bash
# Look for any comment that explains the choice to use first_geom_dimensions
rg -n 'first' rust/sedona-functions/src/st_haszm.rs -B 5 -A 5Length of output: 652
🏁 Script executed:
#!/bin/bash
# Check for similar functions that might handle this differently
rg -n 'fn.*haszm|fn.*hasm|fn.*dims' --type rs -i | head -20Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look at imports and dependencies to understand WkbHeader source
head -40 rust/sedona-functions/src/st_haszm.rsLength of output: 1562
🏁 Script executed:
#!/bin/bash
# Search for WkbHeader in the broader codebase to find its definition
find rust -name '*.rs' -type f | xargs grep -l 'struct WkbHeader' | head -5Length of output: 38
🏁 Script executed:
#!/bin/bash
# Get WkbHeader definition file and content
fd 'lib\.rs|mod\.rs' rust/sedona-geos/src --exec grep -l 'WkbHeader' {} \; | head -3 | xargs cat 2>/dev/null | head -100Length of output: 107
🏁 Script executed:
#!/bin/bash
# More targeted search for WkbHeader
rg -n 'pub struct WkbHeader|pub fn first_geom_dimensions|pub fn dimensions' --type rs -A 15Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check if there's a geometry collection type discriminator
rg -n 'GeometryCollection\|MultiPoint\|MultiLineString\|MultiPolygon|0x07|0x08|0x0A|0x0B|0x0C' --type rs -B 2 -A 2 | head -50Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look at the entire st_haszm.rs file to get full picture
wc -l rust/sedona-functions/src/st_haszm.rs && echo "---" && cat rust/sedona-functions/src/st_haszm.rsLength of output: 9054
🏁 Script executed:
#!/bin/bash
# Check all files in the same directory
ls -la rust/sedona-functions/src/ | head -30Length of output: 1863
🏁 Script executed:
#!/bin/bash
# Search for wkb-related modules
find rust -type f -name '*.rs' | xargs grep -l 'WkbHeader\|WkbBuffer' | head -10Length of output: 76
🏁 Script executed:
#!/bin/bash
# Look at WkbHeader in sedona-core or common location
rg -n 'pub struct WkbHeader\|pub fn first_geom_dimensions' -A 25 --type rsLength of output: 26
🏁 Script executed:
#!/bin/bash
# Check the real behavior by looking at how first_geom_dimensions is used elsewhere
rg -n 'first_geom_dimensions' --type rs -B 3 -A 3Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for the specific test case mentioned in the review
rg -n 'GEOMETRYCOLLECTION.*POINT.*0.*0.*POINT.*Z.*0.*0.*1' --type rs -iLength of output: 26
🏁 Script executed:
#!/bin/bash
# Check for any test about mixed-dimension geometries
rg -n 'mixed|different.*dim|varying' rust/sedona-functions/src/st_haszm.rs -iLength of output: 0
🏁 Script executed:
#!/bin/bash
# Search for test modules
ast-grep --pattern '#[cfg(test)] mod $_ { $$$ }'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look for all test functions in related files
rg -n '#\[test\]' --type rs -A 3 | grep -E 'hasz|hasm|dimension' -i -B 1 -A 3Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check the structure around invoke_scalar in context
sed -n '90,150p' rust/sedona-functions/src/st_haszm.rsLength of output: 1876
🏁 Script executed:
#!/bin/bash
# Final verification: check what WkbHeader methods exist
rg -n 'pub fn' rust/sedona-functions/src/ --type rs | grep -i 'header\|dimension'Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for all references to this function to understand its usage
rg -n 'invoke_scalar' --type rs -B 2 -A 2Length of output: 26
🏁 Script executed:
#!/bin_bash
# Check for geometry collection handling in invoke_array
rg -n 'invoke_array' --type rs -A 80 | head -100Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for the complete picture of how dimensions are handled for collections
rg -n 'pub fn dimensions\|pub fn first_geom_dimensions\|pub fn top_level_dimensions' --type rs -B 2 -A 20Length of output: 26
🏁 Script executed:
#!/bin/bash
# Understand the actual file structure
find rust -name '*.rs' | xargs grep -l 'struct WkbHeader' -A 20 | head -1 | xargs cat | head -150Length of output: 5370
🏁 Script executed:
#!/bin/bash
# Check the entire sedona-common or sedona-core crate structure
find rust -name 'lib.rs' | xargs cat | grep -A 30 'WkbHeader\|dimensions' | head -100Length of output: 0
🏁 Script executed:
#!/bin/bash
# Try a different approach - find where WkbHeader is defined
find rust -name '*.rs' -exec grep -l 'pub struct WkbHeader' {} \;Length of output: 38
🏁 Script executed:
#!/bin/bash
# Get the full st_haszm.rs file and WkbHeader related code
cat rust/sedona-functions/src/st_haszm.rsLength of output: 9008
🏁 Script executed:
#!/bin/bash
# Find WkbHeader definition
find rust -name '*.rs' -type f -exec grep -l 'pub struct WkbHeader\|impl WkbHeader' {} \;Length of output: 38
🏁 Script executed:
#!/bin/bash
# Search for first_geom_dimensions implementation and usage
rg -n 'first_geom_dimensions' --type rs -B 5 -A 10Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for test files related to haszm or dimensions
rg -n '#\[test\].*haszm|#\[test\].*dimension' --type rs -i -A 15Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check if geometry collection types are handled specially anywhere
rg -n 'GeometryCollection\|MultiPoint|MultiLineString|0xF|wkb.*type' --type rs -i -B 2 -A 2 | head -60Length of output: 26
🏁 Script executed:
#!/bin/bash
# Search for invoke_array to understand the fallback mechanism
rg -n 'invoke_array|fn invoke' --type rs -B 3 -A 20 | head -80Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for any child iteration or nested geometry scanning
rg -n 'children\(|iter_geom|for.*child|while.*byte' --type rs -B 3 -A 5Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check what top_level_dimensions returns vs first_geom_dimensions
rg -n 'fn dimensions\(|pub fn dimensions' --type rs -A 15 | head -50Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for geometry collection tests specifically
rg -n 'test.*collection|geometrycollection' --type rs -i -A 10 | head -60Length of output: 26
🏁 Script executed:
#!/bin/bash
# Check related WKB handling code
find rust -path '*/wkb*' -o -path '*/common*' -name '*.rs' | head -10 | xargs ls -lhLength of output: 469
🏁 Script executed:
#!/bin/bash
# Get dimensions enum and related type info
rg -n 'enum Dimensions|pub enum.*Dim' --type rs -A 8Length of output: 26
🏁 Script executed:
#!/bin/bash
# Look for comments explaining the logic around first geometry
rg -n 'first.*geom|first.*child|nested.*dim' rust/sedona-functions/src/st_haszm.rs -B 2 -A 2Length of output: 265
🏁 Script executed:
#!/bin/bash
# Check the entire invoke_scalar implementation in detailed context
sed -n '100,145p' rust/sedona-functions/src/st_haszm.rsLength of output: 1372
Issue verified: false negatives for geometry collections with Z/M only in non-first children.
The concern is valid. The function uses first_geom_dimensions(), which only inspects the first nested geometry. For a collection like GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1)), it would return false for hasZ even though the geometry does contain Z. Tests confirm this gap—existing tests only cover cases where Z/M appears in the first child or at the top level.
To fix, implement the suggested approach: keep the fast path, but when it returns false for collection/multi types, scan all nested headers to verify no Z/M exists anywhere. This requires either:
- Adding child iteration support to
WkbHeader/WkbBuffer(bytes-only, no full parsing), or - Falling back to a slower full-geometry inspection only for uncertain cases
Add test case covering GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1)) to prevent regression.
🤖 Prompt for AI Agents
In rust/sedona-functions/src/st_haszm.rs around lines 109 to 138, the fast-path
uses header.first_geom_dimensions() which misses Z/M in non-first children
(e.g., GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1))) causing false
negatives; keep the current fast-path but when it would return false for
collection/multi types, scan all nested child headers to detect any Z/M and only
then return false—implement this by adding bytes-only child-header iteration to
WkbHeader/WkbBuffer (preferred) or, if that is not available, fall back to a
slower full-geometry inspection for those uncertain cases; also add a test for
GEOMETRYCOLLECTION (POINT (0 0), POINT Z (0 0 1)) to prevent regression.
| /// Returns the top-level dimension of the WKB | ||
| pub fn dimensions(&self) -> Result<Dimensions, SedonaGeometryError> { | ||
| let dimensions = match self.geometry_type / 1000 { | ||
| 0 => Dimensions::Xy, | ||
| 1 => Dimensions::Xyz, | ||
| 2 => Dimensions::Xym, | ||
| 3 => Dimensions::Xyzm, | ||
| _ => { | ||
| return Err(SedonaGeometryError::Invalid(format!( | ||
| "Unexpected code: {}", | ||
| self.geometry_type | ||
| ))) | ||
| } | ||
| }; | ||
| Ok(dimensions) | ||
| } |
There was a problem hiding this comment.
Mask EWKB SRID bit when deriving dimensions (current logic breaks with SRID).
dimensions() and parse_dimensions() divide the raw type by 1000 without clearing SRID_FLAG_BIT. With EWKB SRID set, code / 1000 yields a huge number, not 0..3. Fix by masking the flag first.
Apply:
@@
- pub fn dimensions(&self) -> Result<Dimensions, SedonaGeometryError> {
- let dimensions = match self.geometry_type / 1000 {
+ pub fn dimensions(&self) -> Result<Dimensions, SedonaGeometryError> {
+ let code = self.geometry_type & !SRID_FLAG_BIT;
+ let dimensions = match code / 1000 {
0 => Dimensions::Xy,
1 => Dimensions::Xyz,
2 => Dimensions::Xym,
3 => Dimensions::Xyzm,
_ => {
return Err(SedonaGeometryError::Invalid(format!(
- "Unexpected code: {}",
- self.geometry_type
+ "Unexpected code: {}",
+ code
)))
}
};
Ok(dimensions)
}
@@
- fn parse_dimensions(&mut self) -> Result<Dimensions, SedonaGeometryError> {
+ fn parse_dimensions(&mut self) -> Result<Dimensions, SedonaGeometryError> {
self.read_endian()?;
- let code = self.read_u32()?;
- match code / 1000 {
+ let mut code = self.read_u32()?;
+ code &= !SRID_FLAG_BIT;
+ match code / 1000 {
0 => Ok(Dimensions::Xy),
1 => Ok(Dimensions::Xyz),
2 => Ok(Dimensions::Xym),
3 => Ok(Dimensions::Xyzm),
_ => {
return Err(SedonaGeometryError::Invalid(format!(
"Unexpected code: {code:?}"
)))
}
}
}Also applies to: 479-505
🤖 Prompt for AI Agents
In rust/sedona-geometry/src/wkb_header.rs around lines 190 to 205 (and likewise
for the parse_dimensions code at lines ~479-505), the dimensions() function
currently divides self.geometry_type by 1000 without clearing the EWKB SRID flag
bit, which causes huge unexpected values when SRID is present; fix by first
masking out the SRID_FLAG_BIT from the raw type (e.g., let code =
self.geometry_type & !SRID_FLAG_BIT) and then divide or otherwise derive the
dimension code from that masked value, returning the appropriate Dimensions
variants and preserving existing error handling.
| // let num_geometries = self.read_u32(&buf[5..9], byte_order)?; | ||
| let num_geometries = self.read_u32()?; | ||
|
|
||
| if num_geometries == 0 { | ||
| return Ok(None); | ||
| // return Ok(Some(0)) | ||
| } | ||
|
|
||
| // let mut i = 9; | ||
| if geometry_type & SRID_FLAG_BIT != 0 { | ||
| // i += 4; | ||
| // Skip the SRID | ||
| self.read_u32()?; | ||
| } | ||
|
|
There was a problem hiding this comment.
SRID must be skipped before reading collection count; current order mis-parses EWKB.
In first_geom_idx, you read num_geometries before skipping SRID, so with EWKB the SRID is misread as the count and offsets drift.
Apply:
- // let num_geometries = self.read_u32(&buf[5..9], byte_order)?;
- let num_geometries = self.read_u32()?;
-
- if num_geometries == 0 {
- return Ok(None);
- }
-
- // let mut i = 9;
- if geometry_type & SRID_FLAG_BIT != 0 {
- // i += 4;
- // Skip the SRID
- self.read_u32()?;
- }
+ // Skip SRID if present before reading the count
+ if geometry_type & SRID_FLAG_BIT != 0 {
+ self.read_u32()?; // SRID
+ }
+ // Now read the number of geometries
+ let num_geometries = self.read_u32()?;
+ if num_geometries == 0 {
+ return Ok(None);
+ }
// Recursive call to get the first geom of the first nested geometry
let mut nested_buffer = WkbBuffer::new(&self.buf[self.offset..]);
let off = nested_buffer.first_geom_idx()?;
if let Some(off) = off {
Ok(Some(self.offset + off))
} else {
Ok(None)
}Also applies to: 280-288
🤖 Prompt for AI Agents
In rust/sedona-geometry/src/wkb_header.rs around lines 262-276 (and similarly
280-288), the code reads num_geometries before skipping the optional EWKB SRID,
causing the SRID to be interpreted as the collection count and corrupting
offsets; change the order so that if geometry_type & SRID_FLAG_BIT != 0 you call
self.read_u32()? to consume/skip the SRID first, and only then call
self.read_u32()? to read num_geometries (apply the same reorder in the other
block at 280-288).
value:good-to-have; category:bug; feedback:This is a Draft PR! The println statements will be removed for the final version! |
value:good-to-have; category:bug; feedback:This is a Draft PR! The commented out code will be either enabled or removed for the final version! |
value:good-to-have; category:bug; feedback:This is a Draft PR! The TODOs will be addressed for the final version! |
|
The target PR is a Draft and it seems it is not ready for review yet. |
171: To review by AI